Patient Appointments No-Show Prediction and Analysis¶
Index¶
- Introduction
- Objective
- Project Architecture
- Dataset
- Data Cleaning and Preprocessing
- Exploratory Data Analysis
- Model Selection:
- Deployment
- Conclusion
- Recommendations
- Next Steps
- References
This project focuses on predicting and analyzing patient appointment no-shows using a combination of supervised learning, unsupervised clustering, and natural language processing (NLP) techniques. By leveraging structured data and clinical notes, the project aims to identify key factors influencing patient attendance, understand patient profiles, and extract insights from sentiment and topic modeling. The ultimate goal is to improve healthcare scheduling, reduce missed appointments, and enhance patient care through data-driven decision making.
Problem Statement¶
Missed medical appointments, commonly referred to as "no-shows," present a significant challenge for healthcare systems worldwide. No-shows lead to inefficient use of resources, increased operational costs, and can negatively impact patient health outcomes due to delayed care. The problem is further complicated by the diverse factors influencing patient attendance, including demographic, clinical, behavioral, and emotional aspects, as well as information embedded in unstructured clinical notes.
This project aims to develop a comprehensive, data-driven solution to predict and analyze patient appointment no-shows. By leveraging structured data, unsupervised clustering, and advanced natural language processing (NLP) techniques on clinical notes, the project seeks to:
- Identify key factors influencing patient appointment no-shows using structured and unstructured data.
- Predict patient attendance using supervised machine learning models.
- Understand patient profiles and groupings through unsupervised clustering techniques.
- Analyze patient sentiment and emotions from clinical notes using NLP.
- Extract and interpret topics from patient notes to uncover underlying reasons for no-shows.
- Provide actionable insights to improve healthcare scheduling and reduce missed appointments.
- Enhance patient care and operational efficiency through data-driven decision making.
- Extract actionable insights from both structured and unstructured data to inform interventions and improve healthcare scheduling efficiency.
The ultimate goal is to reduce missed appointments, optimize resource allocation, and enhance patient care through predictive analytics and explainable insights.
The project architecture is designed to support a modular, end-to-end workflow for predicting and analyzing patient appointment no-shows. It consists of the following main components:
Data Ingestion & Preprocessing: Handles loading, cleaning, and transforming raw data using the
DataPreprocessorclass, ensuring data quality for downstream tasks.Exploratory Data Analysis (EDA): Utilizes the
PlotGeneratorclass for visualizing distributions, correlations, and key patterns in the dataset.Supervised Learning: Implements multiple machine learning models (Logistic Regression, Random Forest, XGBoost) to predict patient no-shows, with support for feature engineering, hyperparameter tuning, and evaluation.
Unsupervised Learning: Applies clustering techniques (PCA, K-Means, GMM) via the
ClusteringAnalysisclass to uncover patient profiles and groupings.Natural Language Processing (NLP):
- Sentiment Analysis: Uses the
SentimentAnalysisModelto extract and predict emotional states from patient notes. - Topic Modeling: Employs the
ClinicalTopicModelto identify key topics and reasons for no-shows from clinical text.
- Sentiment Analysis: Uses the
Visualization & Reporting: Centralized plotting and reporting functions for model results, insights, and interpretability.
Deployment & Export: Supports model export and configuration management for deployment and reproducibility.
Streamlit Integration: Provides a user-friendly web interface for interactive model inference, visualization, and reporting, enabling stakeholders to explore predictions and insights in real time.
Configuration Management: All project-wide configuration variables (such as file paths, model parameters, and feature lists) are centralized in
config.pyfor easy maintenance and reproducibility. Additionally, a.envfile is used to manage step-wise execution flags and environment-specific settings, enabling developers to control which pipeline steps are enabled or disabled during development.
This modular structure enables flexible experimentation, easy maintenance, and scalability for future enhancements.
%pip install -r ../requirements.txt
%load_ext autoreload
%autoreload 2
Requirement already satisfied: numpy>=1.20.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 1)) (1.26.4) Requirement already satisfied: pandas>=1.3.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 2)) (2.3.0) Requirement already satisfied: scikit-learn>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 3)) (1.6.1) Requirement already satisfied: xgboost>=1.5.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 4)) (3.0.2) Requirement already satisfied: matplotlib>=3.4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 5)) (3.10.3) Requirement already satisfied: seaborn>=0.11.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 6)) (0.13.2) Requirement already satisfied: plotly>=5.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 7)) (6.1.2) Requirement already satisfied: nltk>=3.6.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 8)) (3.9.1) Requirement already satisfied: transformers>=4.20.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 9)) (4.52.4) Requirement already satisfied: torch>=1.10.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 10)) (2.7.1) Requirement already satisfied: streamlit>=1.10.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 11)) (1.45.1) Requirement already satisfied: pytest>=6.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 12)) (8.4.0) Requirement already satisfied: joblib>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 13)) (1.5.1) Requirement already satisfied: imbalanced-learn>=0.8.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 14)) (0.13.0) Requirement already satisfied: wordcloud>=1.8.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 15)) (1.9.4) Requirement already satisfied: tqdm>=4.60.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 16)) (4.67.1) Requirement already satisfied: accelerate>=0.12.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 18)) (1.7.0) Requirement already satisfied: evaluate>=0.3.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 19)) (0.4.3) Requirement already satisfied: beautifulsoup4>=4.9.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 20)) (4.13.4) Requirement already satisfied: regex>=2022.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 21)) (2024.11.6) Requirement already satisfied: scipy>=1.7.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 22)) (1.15.3) Requirement already satisfied: pytest-cov>=2.12.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 23)) (6.2.1) Requirement already satisfied: jupyter>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 24)) (1.1.1) Requirement already satisfied: ipywidgets>=7.6.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 25)) (8.1.7) Requirement already satisfied: notebook>=6.4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 26)) (7.4.3) Requirement already satisfied: openpyxl>=3.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 27)) (3.1.5) Requirement already satisfied: tabulate>=0.8.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 28)) (0.9.0) Requirement already satisfied: medspacy>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 29)) (1.3.1) Requirement already satisfied: spacy>=3.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 30)) (3.7.5) Requirement already satisfied: kneed>=0.8.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from -r ../requirements.txt (line 31)) (0.8.5) Requirement already satisfied: python-dateutil>=2.8.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pandas>=1.3.0->-r ../requirements.txt (line 2)) (2.9.0.post0) Requirement already satisfied: tzdata>=2022.7 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pandas>=1.3.0->-r ../requirements.txt (line 2)) (2025.2) Requirement already satisfied: pytz>=2020.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pandas>=1.3.0->-r ../requirements.txt (line 2)) (2025.2) Requirement already satisfied: threadpoolctl>=3.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from scikit-learn>=1.0.0->-r ../requirements.txt (line 3)) (3.6.0) Requirement already satisfied: cycler>=0.10 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (0.12.1) Requirement already satisfied: contourpy>=1.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (1.3.2) Requirement already satisfied: pillow>=8 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (11.2.1) Requirement already satisfied: fonttools>=4.22.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (4.58.4) Requirement already satisfied: pyparsing>=2.3.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (3.2.3) Requirement already satisfied: kiwisolver>=1.3.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (1.4.8) Requirement already satisfied: packaging>=20.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from matplotlib>=3.4.0->-r ../requirements.txt (line 5)) (24.2) Requirement already satisfied: narwhals>=1.15.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from plotly>=5.0.0->-r ../requirements.txt (line 7)) (1.42.1) Requirement already satisfied: click in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nltk>=3.6.0->-r ../requirements.txt (line 8)) (8.2.1) Requirement already satisfied: huggingface-hub<1.0,>=0.30.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (0.33.0) Requirement already satisfied: pyyaml>=5.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (6.0.2) Requirement already satisfied: safetensors>=0.4.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (0.5.3) Requirement already satisfied: requests in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (2.32.4) Requirement already satisfied: tokenizers<0.22,>=0.21 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (0.21.1) Requirement already satisfied: filelock in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from transformers>=4.20.0->-r ../requirements.txt (line 9)) (3.18.0) Requirement already satisfied: fsspec in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from torch>=1.10.0->-r ../requirements.txt (line 10)) (2025.3.0) Requirement already satisfied: sympy>=1.13.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from torch>=1.10.0->-r ../requirements.txt (line 10)) (1.14.0) Requirement already satisfied: jinja2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from torch>=1.10.0->-r ../requirements.txt (line 10)) (3.1.6) Requirement already satisfied: networkx in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from torch>=1.10.0->-r ../requirements.txt (line 10)) (3.4.2) Requirement already satisfied: typing-extensions>=4.10.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from torch>=1.10.0->-r ../requirements.txt (line 10)) (4.14.0) Requirement already satisfied: pydeck<1,>=0.8.0b4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (0.9.1) Requirement already satisfied: tornado<7,>=6.0.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (6.5.1) Requirement already satisfied: tenacity<10,>=8.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (9.1.2) Requirement already satisfied: pyarrow>=7.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (20.0.0) Requirement already satisfied: altair<6,>=4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (5.5.0) Requirement already satisfied: blinker<2,>=1.5.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (1.9.0) Requirement already satisfied: watchdog<7,>=2.1.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (6.0.0) Requirement already satisfied: protobuf<7,>=3.20 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (6.31.1) Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (3.1.44) Requirement already satisfied: toml<2,>=0.10.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (0.10.2) Requirement already satisfied: cachetools<6,>=4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from streamlit>=1.10.0->-r ../requirements.txt (line 11)) (5.5.2)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: exceptiongroup>=1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (1.3.0) Requirement already satisfied: tomli>=1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (2.2.1) Requirement already satisfied: pygments>=2.7.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (2.19.1) Requirement already satisfied: iniconfig>=1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (2.1.0) Requirement already satisfied: colorama>=0.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (0.4.6) Requirement already satisfied: pluggy<2,>=1.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest>=6.0.0->-r ../requirements.txt (line 12)) (1.6.0) Requirement already satisfied: sklearn-compat<1,>=0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from imbalanced-learn>=0.8.0->-r ../requirements.txt (line 14)) (0.1.3) Requirement already satisfied: psutil in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from accelerate>=0.12.0->-r ../requirements.txt (line 18)) (7.0.0) Requirement already satisfied: xxhash in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from evaluate>=0.3.0->-r ../requirements.txt (line 19)) (3.5.0) Requirement already satisfied: multiprocess in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from evaluate>=0.3.0->-r ../requirements.txt (line 19)) (0.70.16) Requirement already satisfied: dill in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from evaluate>=0.3.0->-r ../requirements.txt (line 19)) (0.3.8) Requirement already satisfied: datasets>=2.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from evaluate>=0.3.0->-r ../requirements.txt (line 19)) (3.6.0) Requirement already satisfied: soupsieve>1.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from beautifulsoup4>=4.9.0->-r ../requirements.txt (line 20)) (2.7) Requirement already satisfied: coverage[toml]>=7.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pytest-cov>=2.12.0->-r ../requirements.txt (line 23)) (7.9.1) Requirement already satisfied: jupyterlab in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter>=1.0.0->-r ../requirements.txt (line 24)) (4.4.3) Requirement already satisfied: ipykernel in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter>=1.0.0->-r ../requirements.txt (line 24)) (6.29.5) Requirement already satisfied: nbconvert in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter>=1.0.0->-r ../requirements.txt (line 24)) (7.16.6) Requirement already satisfied: jupyter-console in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter>=1.0.0->-r ../requirements.txt (line 24)) (6.6.3) Requirement already satisfied: ipython>=6.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (8.37.0) Requirement already satisfied: comm>=0.1.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.2.2) Requirement already satisfied: widgetsnbextension~=4.0.14 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (4.0.14) Requirement already satisfied: traitlets>=4.3.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (5.14.3) Requirement already satisfied: jupyterlab_widgets~=3.0.15 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (3.0.15) Requirement already satisfied: notebook-shim<0.3,>=0.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.2.4) Requirement already satisfied: jupyter-server<3,>=2.4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.16.0) Requirement already satisfied: jupyterlab-server<3,>=2.27.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.27.3) Requirement already satisfied: et-xmlfile in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from openpyxl>=3.0.0->-r ../requirements.txt (line 27)) (2.0.0) Requirement already satisfied: pysbd==0.3.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy>=1.0.0->-r ../requirements.txt (line 29)) (0.3.4) Requirement already satisfied: jsonschema in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy>=1.0.0->-r ../requirements.txt (line 29)) (4.24.0) Requirement already satisfied: PyRuSH>=1.0.8 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.0.9) Requirement already satisfied: medspacy_quickumls==3.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy>=1.0.0->-r ../requirements.txt (line 29)) (3.2) Requirement already satisfied: unidecode>=0.4.19 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy_quickumls==3.2->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.4.0) Requirement already satisfied: pysimstring in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy_quickumls==3.2->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.3.0) Requirement already satisfied: medspacy_unqlite>=0.8.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy_quickumls==3.2->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (0.9.8) Requirement already satisfied: six in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from medspacy_quickumls==3.2->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.17.0) Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (2.0.10) Requirement already satisfied: typer<1.0.0,>=0.3.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.16.0) Requirement already satisfied: cymem<2.1.0,>=2.0.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (2.0.11) Requirement already satisfied: thinc<8.3.0,>=8.2.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (8.2.5) Requirement already satisfied: preshed<3.1.0,>=3.0.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (3.0.10) Requirement already satisfied: srsly<3.0.0,>=2.4.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (2.5.1) Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (3.0.12) Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.0.5) Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.0.13) Requirement already satisfied: setuptools in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (63.2.0) Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (2.11.7) Requirement already satisfied: weasel<0.5.0,>=0.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.4.1) Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (3.5.0) Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.1.3) Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (3.12.13) Requirement already satisfied: gitdb<5,>=4.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit>=1.10.0->-r ../requirements.txt (line 11)) (4.0.12) Requirement already satisfied: decorator in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (5.2.1) Requirement already satisfied: matplotlib-inline in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.1.7) Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (3.0.51) Requirement already satisfied: stack_data in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.6.3) Requirement already satisfied: jedi>=0.16 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.19.2) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (2025.4.1) Requirement already satisfied: rpds-py>=0.7.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (0.25.1) Requirement already satisfied: attrs>=22.2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (25.3.0) Requirement already satisfied: referencing>=0.28.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (0.36.2) Requirement already satisfied: pywinpty>=2.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.0.15) Requirement already satisfied: argon2-cffi>=21.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (25.1.0) Requirement already satisfied: jupyter-server-terminals>=0.4.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.5.3) Requirement already satisfied: jupyter-client>=7.4.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (8.6.3) Requirement already satisfied: send2trash>=1.8.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (1.8.3) Requirement already satisfied: overrides>=5.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (7.7.0) Requirement already satisfied: websocket-client>=1.7 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (1.8.0) Requirement already satisfied: jupyter-events>=0.11.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.12.0) Requirement already satisfied: pyzmq>=24 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (27.0.0) Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (5.8.1) Requirement already satisfied: anyio>=3.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (4.9.0) Requirement already satisfied: nbformat>=5.3.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (5.10.4) Requirement already satisfied: terminado>=0.8.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.18.1) Requirement already satisfied: prometheus-client>=0.9 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.22.1) Requirement already satisfied: MarkupSafe>=2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jinja2->torch>=1.10.0->-r ../requirements.txt (line 10)) (3.0.2) Requirement already satisfied: async-lru>=1.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyterlab->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (2.0.5) Requirement already satisfied: jupyter-lsp>=2.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyterlab->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (2.2.5) Requirement already satisfied: httpx>=0.25.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyterlab->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.28.1) Requirement already satisfied: debugpy>=1.6.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipykernel->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (1.8.14) Requirement already satisfied: nest-asyncio in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from ipykernel->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (1.6.0) Requirement already satisfied: json5>=0.9.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.12.0) Requirement already satisfied: babel>=2.10 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.17.0) Requirement already satisfied: language-data>=1.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.3.0) Requirement already satisfied: nbclient>=0.5.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.10.2) Requirement already satisfied: mistune<4,>=2.0.3 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (3.1.3) Requirement already satisfied: pandocfilters>=1.4.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (1.5.1) Requirement already satisfied: defusedxml in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.7.1) Requirement already satisfied: jupyterlab-pygments in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.3.0) Requirement already satisfied: bleach[css]!=5.0.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (6.2.0) Requirement already satisfied: annotated-types>=0.6.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.7.0) Requirement already satisfied: typing-inspection>=0.4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.4.1) Requirement already satisfied: pydantic-core==2.33.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4->spacy>=3.0.0->-r ../requirements.txt (line 30)) (2.33.2) Requirement already satisfied: quicksectx>=0.3.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from PyRuSH>=1.0.8->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (0.4.0) Requirement already satisfied: Cython in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from PyRuSH>=1.0.8->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (3.0.11) Requirement already satisfied: PyFastNER>=1.0.8 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from PyRuSH>=1.0.8->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.0.10) Requirement already satisfied: certifi>=2017.4.17 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from requests->transformers>=4.20.0->-r ../requirements.txt (line 9)) (2025.4.26) Requirement already satisfied: charset_normalizer<4,>=2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from requests->transformers>=4.20.0->-r ../requirements.txt (line 9)) (3.4.2) Requirement already satisfied: urllib3<3,>=1.21.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from requests->transformers>=4.20.0->-r ../requirements.txt (line 9)) (2.4.0) Requirement already satisfied: idna<4,>=2.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from requests->transformers>=4.20.0->-r ../requirements.txt (line 9)) (3.10) Requirement already satisfied: mpmath<1.4,>=1.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from sympy>=1.13.3->torch>=1.10.0->-r ../requirements.txt (line 10)) (1.3.0) Requirement already satisfied: blis<0.8.0,>=0.7.8 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from thinc<8.3.0,>=8.2.2->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.7.11) Requirement already satisfied: confection<1.0.0,>=0.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from thinc<8.3.0,>=8.2.2->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.1.5) Requirement already satisfied: shellingham>=1.3.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.5.4) Requirement already satisfied: rich>=10.11.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from typer<1.0.0,>=0.3.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (14.0.0) Requirement already satisfied: cloudpathlib<1.0.0,>=0.7.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.21.1) Requirement already satisfied: smart-open<8.0.0,>=5.2.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from weasel<0.5.0,>=0.1.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (7.1.0) Requirement already satisfied: aiosignal>=1.1.2 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (1.3.2) Requirement already satisfied: frozenlist>=1.1.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (1.7.0) Requirement already satisfied: multidict<7.0,>=4.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (6.4.4) Requirement already satisfied: yarl<2.0,>=1.17.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (1.20.1) Requirement already satisfied: async-timeout<6.0,>=4.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (5.0.1) Requirement already satisfied: aiohappyeyeballs>=2.5.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (2.6.1) Requirement already satisfied: propcache>=0.2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec->torch>=1.10.0->-r ../requirements.txt (line 10)) (0.3.2) Requirement already satisfied: sniffio>=1.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from anyio>=3.1.0->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (1.3.1) Requirement already satisfied: argon2-cffi-bindings in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (21.2.0) Requirement already satisfied: webencodings in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from bleach[css]!=5.0.0->nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.5.1) Requirement already satisfied: tinycss2<1.5,>=1.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from bleach[css]!=5.0.0->nbconvert->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (1.4.0) Requirement already satisfied: smmap<6,>=3.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit>=1.10.0->-r ../requirements.txt (line 11)) (5.0.2) Requirement already satisfied: httpcore==1.* in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from httpx>=0.25.0->jupyterlab->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (1.0.9) Requirement already satisfied: h11>=0.16 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from httpcore==1.*->httpx>=0.25.0->jupyterlab->jupyter>=1.0.0->-r ../requirements.txt (line 24)) (0.16.0) Requirement already satisfied: parso<0.9.0,>=0.8.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.8.4) Requirement already satisfied: pywin32>=300 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-core!=5.0.*,>=4.12->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (310) Requirement already satisfied: platformdirs>=2.5 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-core!=5.0.*,>=4.12->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (4.3.8) Requirement already satisfied: rfc3339-validator in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.1.4) Requirement already satisfied: python-json-logger>=2.0.4 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (3.3.0) Requirement already satisfied: rfc3986-validator>=0.1.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jupyter-events>=0.11.0->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (0.1.1) Requirement already satisfied: marisa-trie>=1.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from language-data>=1.2->langcodes<4.0.0,>=3.2.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.2.1) Requirement already satisfied: fastjsonschema>=2.15 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from nbformat>=5.3.0->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.21.1) Requirement already satisfied: wcwidth in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.2.13) Requirement already satisfied: markdown-it-py>=2.2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (3.0.0) Requirement already satisfied: wrapt in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from smart-open<8.0.0,>=5.2.1->weasel<0.5.0,>=0.1.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (1.17.2) Requirement already satisfied: executing>=1.2.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (2.2.0) Requirement already satisfied: pure-eval in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (0.2.3) Requirement already satisfied: asttokens>=2.1.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from stack_data->ipython>=6.1.0->ipywidgets>=7.6.0->-r ../requirements.txt (line 25)) (3.0.0) Requirement already satisfied: webcolors>=24.6.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (24.11.1) Requirement already satisfied: jsonpointer>1.13 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (3.0.0) Requirement already satisfied: isoduration in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (20.11.0) Requirement already satisfied: uri-template in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.3.0) Requirement already satisfied: fqdn in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.5.1) Requirement already satisfied: mdurl~=0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer<1.0.0,>=0.3.0->spacy>=3.0.0->-r ../requirements.txt (line 30)) (0.1.2) Requirement already satisfied: cffi>=1.0.1 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (1.17.1) Requirement already satisfied: pycparser in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=6.4.0->-r ../requirements.txt (line 26)) (2.22) Requirement already satisfied: arrow>=0.15.0 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from isoduration->jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (1.3.0) Requirement already satisfied: types-python-dateutil>=2.8.10 in d:\personal\ai-admissions\semester 3\aai-510 - machine learning fundamentals and applications\final team project\aai510_3proj\.venv\lib\site-packages (from arrow>=0.15.0->isoduration->jsonschema->medspacy>=1.0.0->-r ../requirements.txt (line 29)) (2.9.0.20250516) The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
[notice] A new release of pip available: 22.2.1 -> 25.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip
# Import necessary libraries
import sys
import os
import shutil
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
sys.path.append(os.path.abspath(os.path.join(os.pardir, 'src')))
# Import project-specific internal modules
from preprocessor import DataPreprocessor
from src.plots import PlotGenerator
from src import config
from config import RUN_CONFIGURATION, EMOTION_STATES, NLP_CONFIG, SENTIMENT_MODEL_EXPORT_PATH_RAW, \
SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED, EMOTION_VARIATIONS_PATH, NEGATION_PATTERNS_PATH, \
HYPERPARAMETERS, RANDOM_STATE, PREDICTION_MODEL_EXPORT_PATH, TOPIC_MODEL_EXPORT_PATH, \
is_step_enabled
# Unsupervised learning imports
from clustering import ClusteringAnalysis
# Supervised learning imports
# NLP imports
from src.sentiment_analysis import SentimentAnalysisModel
from src.emotion_postprocessor import EmotionPostProcessor
from src.clinical_notes_prediction import ClinicalNotesNoShowPredictor
from clinical_topic_model import ClinicalTopicModel
# Create an instance of the preprocessing and plotting classes
preprocessor = DataPreprocessor(config)
plotter = PlotGenerator(style='whitegrid', palette='viridis', figsize=(10, 6))
sns.set(style='whitegrid')
warnings.filterwarnings("ignore")
The dataset used in this project is sourced from Kaggle: No-show appointments. It contains information about medical appointments in Brazil and whether patients showed up for their scheduled appointments. The dataset includes the following features:
- PatientId: Unique identifier for each patient.
- AppointmentID: Unique identifier for each appointment.
- Gender: Patient's gender (Male or Female). Females represent a larger proportion, reflecting higher healthcare engagement.
- DataMarcacaoConsulta: Date of the actual appointment.
- DataAgendamento: Date when the appointment was scheduled.
- Age: Patient's age.
- Neighbourhood: Location where the appointment takes place.
- Scholarship: Indicates if the patient is enrolled in the Bolsa Família welfare program (more info).
- Hipertension: Whether the patient has hypertension (True/False).
- Diabetes: Whether the patient has diabetes (True/False).
- Alcoholism: Whether the patient has alcoholism (True/False).
- Handcap: Whether the patient is handicapped (True/False).
- SMS_received: Number of SMS reminders sent to the patient.
- No-show: Indicates if the patient missed the appointment (True/False).
This dataset enables analysis of demographic, clinical, and behavioral factors influencing patient attendance, supporting predictive modeling and healthcare insights.
Additional Columns: PatientNotes, PatientSentiment, and NoShowReason
To enrich the dataset with realistic unstructured data, additional columns - PatientNotes, PatientSentiment, and NoShowReason - were generated using custom simulation rules implemented in datasimulator.py. The simulation process applied the following logic:
- PatientNotes: Synthetic clinical notes were generated by combining demographic, appointment, and health condition information with randomly selected phrases that reflect common patient experiences, concerns, or behaviors (e.g., anxiety about procedures, confusion over instructions, or positive engagement).
- PatientSentiment: Sentiment labels were assigned based on keywords and patterns detected in the simulated notes, mapping text to emotional states such as anxiety, stress, hopefulness, or fear, using predefined vocabularies and context-aware rules.
- NoShowReason: For patients marked as no-shows, plausible reasons were simulated by considering factors like age, comorbidities, appointment timing, and prior attendance history, assigning reasons such as "transportation issues," "forgot appointment," or "felt unwell."
These custom rules ensure that the simulated columns provide diverse, contextually relevant, and analytically valuable data for downstream NLP and predictive modeling tasks.
if is_step_enabled('dataload'):
df = preprocessor.load_data(config.DATASET_PATH)
display("shape:", df.shape)
display("columns:", df.columns)
display(df.head())
display(df.describe())
'shape:'
(110527, 17)
'columns:'
Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension',
'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show',
'PatientNotes', 'PatientSentiment', 'NoShowReason'],
dtype='object')
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | PatientNotes | PatientSentiment | NoShowReason | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No | Patient with poorly controlled hypertension (s... | Patient is worried about long-term effects of ... | Positive experiences with clinic staff, such a... |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No | Elderly patient. Discussed fall prevention str... | Confusion about insurance coverage and billing... | A clear understanding of their health status, ... |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No | Patient with hypertension is following a low-s... | Anxiety and confusion about diabetes care cont... | A clear understanding of their health status, ... |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No | Child accompanied by parent/guardian. Reviewed... | Patient is worried about memory loss and manag... | The patient is committed to managing chronic c... |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No | The patient is managing type 2 diabetes with M... | Fear of medication side effects and doubts abo... | The patient prioritizes following medical advi... |
| PatientId | AppointmentID | Age | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
The preprocess_data function in the DataPreprocessor class is responsible for cleaning and preparing the raw dataset for analysis and modeling.
Here are the typical steps performed in this function (refer to your src/preprocessor.py for exact details):
Handling Missing Values: Identifies and fills or removes missing values in key columns to ensure data integrity.
Data Type Conversion: Converts columns to appropriate data types (e.g., dates to datetime, categorical variables to category type, numeric columns to float/int).
Feature Engineering: Creates new features or transforms existing ones (e.g., extracting appointment lead time, encoding categorical variables, generating binary flags).
Outlier Detection and Removal: Detects and handles outliers in numerical columns (such as Age) to reduce skewness and improve model performance.
Standardization and Normalization: Scales numerical features if required for downstream modeling.
Text Cleaning: Cleans text columns (like PatientNotes) by removing special characters, lowercasing, and handling typos or irrelevant tokens.
Consistency Checks: Ensures consistent formatting (e.g., standardizing Yes/No or True/False values, fixing inconsistent labels).
Final Sanity Checks: Verifies the shape, column names, and summary statistics of the cleaned DataFrame.
The cleaned DataFrame is then returned for further analysis and modeling.
if is_step_enabled('data_preprocess'):
df = preprocessor.preprocess_data(df)
display("shape:", df.shape)
display("columns:", df.columns)
display(df.head())
display(df.describe())
[preprocessing] Starting preprocessing...
Initial shape of the dataset: (110527, 17)
Initial columns in the dataset: Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension',
'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show',
'PatientNotes', 'PatientSentiment', 'NoShowReason'],
dtype='object')
Dropping unnecessary columns...
Remaining columns: Index(['Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood',
'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handcap',
'SMS_received', 'No-show', 'PatientNotes', 'PatientSentiment',
'NoShowReason'],
dtype='object')
Converting date columns to datetime...
Handling missing values...
Adding emotional state columns...
Emotional state columns added: ['anxiety', 'stress', 'confusion', 'hopeful', 'fear']
Final shape of the dataset: (110527, 21)
Final columns in the dataset: Index(['Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood',
'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handcap',
'SMS_received', 'No-show', 'PatientNotes', 'PatientSentiment',
'NoShowReason', 'WaitDays', 'anxiety', 'stress', 'confusion', 'hopeful',
'fear'],
dtype='object')
[preprocessing] Preprocessing complete.
'shape:'
(110527, 21)
'columns:'
Index(['Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood',
'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handcap',
'SMS_received', 'No-show', 'PatientNotes', 'PatientSentiment',
'NoShowReason', 'WaitDays', 'anxiety', 'stress', 'confusion', 'hopeful',
'fear'],
dtype='object')
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | ... | No-show | PatientNotes | PatientSentiment | NoShowReason | WaitDays | anxiety | stress | confusion | hopeful | fear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2016-04-29 18:38:08+00:00 | 2016-04-29 00:00:00+00:00 | 62.0 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | ... | 0 | Patient with poorly controlled hypertension (s... | Patient is worried about long-term effects of ... | Positive experiences with clinic staff, such a... | -1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 2016-04-29 16:08:27+00:00 | 2016-04-29 00:00:00+00:00 | 56.0 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | ... | 0 | Elderly patient. Discussed fall prevention str... | Confusion about insurance coverage and billing... | A clear understanding of their health status, ... | -1 | 0 | 1 | 1 | 0 | 0 |
| 2 | 0 | 2016-04-29 16:19:04+00:00 | 2016-04-29 00:00:00+00:00 | 62.0 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | ... | 0 | Patient with hypertension is following a low-s... | Anxiety and confusion about diabetes care cont... | A clear understanding of their health status, ... | -1 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 2016-04-29 17:29:31+00:00 | 2016-04-29 00:00:00+00:00 | 8.0 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | ... | 0 | Child accompanied by parent/guardian. Reviewed... | Patient is worried about memory loss and manag... | The patient is committed to managing chronic c... | -1 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 2016-04-29 16:07:23+00:00 | 2016-04-29 00:00:00+00:00 | 56.0 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | ... | 0 | The patient is managing type 2 diabetes with M... | Fear of medication side effects and doubts abo... | The patient prioritizes following medical advi... | -1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 21 columns
| Gender | Age | Scholarship | Hypertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | WaitDays | anxiety | stress | confusion | hopeful | fear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 0.350023 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 | 0.201933 | 9.183702 | 0.491979 | 0.603273 | 0.532594 | 0.106336 | 0.426412 |
| std | 0.476979 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 | 0.401444 | 15.254996 | 0.499938 | 0.489221 | 0.498939 | 0.308269 | 0.494557 |
| min | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 14.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 | 1.000000 | 178.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
The Exploratory Data Analysis (EDA) section generates several key plots to help understand the dataset:
Age Distribution Histogram:
- Shows the distribution of patient ages in the dataset.
- Helps identify age groups with higher or lower representation and spot any outliers.
No-show vs Show Countplot:
- Visualizes the count of appointments where patients showed up versus those they missed.
- Useful for understanding class imbalance in the target variable.
Correlation Heatmap:
- Displays the correlation coefficients between numerical features.
- Helps identify relationships between variables, such as which features are strongly correlated with each other or with the target.
Emotional States Bar Plot:
- Shows the frequency of different emotional states detected in patient sentiment data.
- Provides insight into the emotional landscape of the patient population.
Word Clouds:
- Generated for PatientSentiment, PatientNotes, and NoShowReason columns.
- Highlight the most common words and themes in each text field, offering a qualitative overview of patient concerns and reasons for no-shows.
if is_step_enabled('eda'):
# Distribution of Age - Using class-based approach
plotter.plot_histplot(
data=df,
column='Age',
bins=30,
kde=True,
title='Age Distribution',
xlabel='Age',
ylabel='Frequency',
figsize=(10, 6)
)
# Countplot of No-show vs Show
plotter.plot_countplot(
data=df,
column='No-show',
title='Count of No-show vs Show',
xlabel='No-show',
ylabel='Count',
figsize=(8, 5)
)
# Correlation heatmap
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()
plotter.plot_heatmap(
data=correlation_matrix,
title='Correlation Heatmap',
fmt='.2f',
cmap='coolwarm',
square=True,
figsize=(12, 8)
)
- The age distribution peaks in early childhood and declines steadily with age, showing most patients are under 60.
- The distribution is fairly symmetric with a slight right skew due to fewer older patients.
- The plot shows that most patients attended their appointments (No-show=0), while a smaller group missed them (No-show=1).
- This indicates a class imbalance, which is important to consider for predictive modeling.
- The correlation heatmap shows weak to moderate relationships among most features, with few strong correlations (e.g., Age-Hypertension, Hypertension-Diabetes).
- Most variables are largely independent, suggesting minimal multicollinearity and diverse feature contributions for modeling.
if is_step_enabled('eda'):
# Plot emotional states as a bar plot - Using class method
plotter.plot_emotional_states_bar(df)
# Plot word clouds for PatientSentiment, PatientNotes, and NoShowReason
plotter.plot_text_wordcloud_custom_stopwords(df['PatientSentiment'], title='Patient Sentiment Word Cloud')
plotter.plot_text_wordcloud_custom_stopwords(df['PatientNotes'], title='Patient Notes Word Cloud')
plotter.plot_text_wordcloud_custom_stopwords(df['NoShowReason'], title='No-Show Reason Word Cloud')
Total unique words after filtering: 161
Top 10 most frequent words: {'confusion': 77779, 'anxiety': 62099, 'fear': 57339, 'stress': 55610, 'feels': 54774, 'health': 50047, 'blood': 35628, 'expresses': 34563, 'missed': 33931, 'alcohol': 29663}
Total unique words after filtering: 480
Top 10 most frequent words: {'routine': 75106, 'hypertension': 69899, 'health': 62876, 'provided': 61365, 'discussed': 56894, 'support': 54154, 'importance': 45468, 'strategies': 43443, 'addressed': 43435, 'prevention': 43424}
Total unique words after filtering: 477
Top 10 most frequent words: {'health': 35477, 'follow': 21074, 'including': 19948, 'attend': 15422, 'regular': 15302, 'family': 13660, 'recent': 13186, 'healthcare': 11945, 'especially': 11621, 'scheduled': 11592}
- The bar plot shows that stress, confusion, and anxiety are the most common emotional states among patients, with stress being the highest.
- Hopeful sentiment is much less frequent, indicating a predominance of negative emotions in patient sentiment data.
- The word cloud highlights that "confusion," "anxiety," "fear," and "stress" are the most frequently expressed sentiments among patients. This suggests that negative emotions dominate patient feedback, indicating areas for targeted intervention.
- The word cloud shows that "routine", "hypertension", "health", and "provided" are the most frequently mentioned terms in patient notes, highlighting a focus on regular care and chronic condition management.
- The word cloud shows that "health", "follow", "attend", and "including" are the most common reasons cited for no-shows, highlighting the importance of health-related factors and follow-up in patient attendance.
This project employs a combination of supervised, unsupervised, and natural language processing (NLP) models to predict and analyze patient appointment no-shows.
Supervised Learning Models¶
Logistic Regression
Logistic Regression is a statistical model used for binary classification tasks. It estimates the probability that a given input belongs to a particular category by applying the logistic (sigmoid) function to a linear combination of input features (Hosmer et al., 2013).
Justification: Logistic Regression is interpretable and effective for baseline binary classification, making it suitable for predicting no-show events based on structured patient data.
Random Forest
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of their predictions (Breiman, 2001). It handles non-linear relationships and interactions between features well.
Justification: Random Forest is robust to overfitting and can capture complex patterns in the data, which is valuable for healthcare datasets with mixed feature types.
Gradient Boosting (XGBoost)
XGBoost is an optimized implementation of gradient boosting machines, which sequentially build decision trees to correct errors made by previous trees (Chen & Guestrin, 2016).
Justification: XGBoost is known for its high predictive performance and efficiency, making it suitable for structured data with potential feature interactions.
Unsupervised Learning Models¶
Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that transforms correlated features into a set of linearly uncorrelated components, capturing the maximum variance in the data (Jolliffe & Cadima, 2016).
Justification: PCA helps visualize and reduce the complexity of high-dimensional patient data, facilitating clustering and interpretation.
K-Means Clustering
K-Means is an unsupervised algorithm that partitions data into k clusters by minimizing the within-cluster sum of squares (MacQueen, 1967).
Justification: K-Means is efficient for grouping patients with similar profiles, aiding in understanding patient segments and tailoring interventions.
Gaussian Mixture Model (GMM)
GMM is a probabilistic model that assumes data is generated from a mixture of several Gaussian distributions (Reynolds, 2009).
Justification: GMM provides soft clustering and can model clusters of different shapes and sizes, which is useful for heterogeneous patient populations.
Natural Language Processing (NLP) Models¶
Sentiment Analysis Model
A custom neural network-based sentiment analysis model is used to detect emotional states in patient notes, leveraging domain-specific vocabularies and context-aware rules (Cambria et al., 2017).
Justification: Understanding patient sentiment provides insights into behavioral factors influencing no-shows, complementing structured data analysis.
Topic Modeling (LDA with MedSpaCy Preprocessing)
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for discovering topics in text corpora (Blei et al., 2003). MedSpaCy is used for clinical concept extraction and preprocessing.
Justification: Topic modeling uncovers underlying themes and reasons for no-shows in clinical notes, supporting qualitative analysis.
This use case focuses exclusively on Supervised Learning for Patient Show/No Show Prediction. Missed medical appointments, or "no-shows," can disrupt healthcare operations and negatively impact patient outcomes. To address this, the project leverages supervised machine learning models to predict whether a patient will attend their scheduled appointment.
Algorithm and Approach:
- Supervised Learning Models: Logistic Regression, Random Forest, and XGBoost are employed to predict appointment attendance using structured data such as demographics, medical history, and appointment details.
- Feature Engineering: Relevant features are selected and engineered to improve model performance, including patient age, prior no-show history, comorbidities, and appointment lead time.
- Model Evaluation: The models are evaluated using metrics like accuracy, precision, recall, and ROC-AUC to ensure reliable predictions.
By accurately identifying patients at risk of missing appointments, this approach enables healthcare providers to implement targeted interventions, optimize scheduling, and improve overall patient care.
This use case focuses on Unsupervised Learning for Understanding Patient Profiles. Unlike supervised learning, unsupervised learning does not use labeled outcomes (such as "no-show" or "show") but instead seeks to uncover hidden patterns and groupings within the patient data.
Algorithm and Approach:
- Dimensionality Reduction (PCA): Principal Component Analysis (PCA) is applied to reduce the complexity of the dataset by transforming correlated features into a smaller set of uncorrelated components. This helps visualize the data and identify the most influential features.
- Clustering Algorithms: K-Means and Gaussian Mixture Models (GMM) are used to group patients into clusters based on similarities in their demographic, clinical, and behavioral characteristics. The optimal number of clusters is determined using the Elbow Method.
- Feature Engineering: Relevant features are selected and engineered to enhance clustering quality, such as emotional distress scores and standardized numerical variables.
- Cluster Analysis: The resulting clusters are analyzed to interpret patient profiles, identify common traits within each group, and uncover patterns that may influence appointment attendance or healthcare needs.
By segmenting patients into meaningful clusters, this approach enables healthcare providers to tailor interventions, personalize care, and better understand the diverse needs of their patient population.
- add_emotional_distress():
- This method creates a new feature in the DataFrame that quantifies the overall emotional distress of each patient.
- It typically sums or combines the binary emotion columns (e.g., 'anxiety', 'stress', 'confusion', etc.) into a single 'emotional_distress' score.
- This score can be used as an additional feature for clustering and analysis.
- standardize():
- This method standardizes (scales) the numerical features in the DataFrame so that each has zero mean and unit variance.
- Standardization is important for clustering and PCA because it ensures that all features contribute equally, regardless of their original scale.
- The standardized data is usually stored in an attribute like self.numeric_df.
- run_pca(n_components=None):
- This method applies Principal Component Analysis (PCA) to the standardized numerical data.
- PCA reduces the dimensionality of the data by transforming it into a set of orthogonal components that capture the most variance.
- If n_components is None, all components are kept; otherwise, only the specified number of components are retained.
- The transformed data is stored in an attribute like self.X_pca for further analysis and visualization.
if is_step_enabled('unsupervised_clustering'):
clustering = ClusteringAnalysis(df)
clustering.add_emotional_distress()
clustering.standardize()
clustering.run_pca(n_components=None)
Dimensionality reduction simplifies high-dimensional data by transforming it into a lower-dimensional space while retaining most of the important information. Principal Component Analysis (PCA) is used to reduce the number of numerical features in the patient dataset. This helps visualize complex relationships, improves clustering performance, and reduces noise. The code standardizes features, applies PCA, and selects the minimum number of components needed to explain a target percentage (e.g., 90%) of the variance, making subsequent analysis more efficient and interpretable.
if is_step_enabled('unsupervised_clustering'):
pca_columns = clustering.numeric_df.columns.tolist()
for i in range(0, len(pca_columns), 3):
cols = pca_columns[i:i+3]
plotter.plot_pca_3d_colored_by_features(clustering.X_pca, clustering.numeric_df, cols)
The 3D PCA plots reveal how patient data clusters and separates based on individual features, highlighting distinct groupings and patterns in reduced dimensional space.
Distinct Clusters Exist: Binary features like Hypertension, Diabetes, SMS_received, and emotional indicators like Fear and Stress show clear separability, suggesting they contribute significantly to underlying patient groupings.
Emotional & Behavioral Features Are Key Drivers: Variables such as Emotional_Distress, Fear, Anxiety, and Stress form well-separated gradients or clusters, indicating their strong influence on patient behavior, including no-shows.
Demographics Offer Moderate Segmentation: Age and Gender show gradual transitions or weak clustering, useful when combined with emotional or clinical factors for richer segmentation.
No-show Patterns Align with Emotions & SMS: The No-show plot shares spatial similarity with emotional variables and SMS reception, indicating these features are predictive and clusterable.
if is_step_enabled('unsupervised_clustering'):
target_variance = 0.90
top_n, X_reduced = clustering.select_top_n_components(target_variance=target_variance)
print(f"Number of components to retain ≥ {target_variance*100:.0f}% variance: {top_n}")
explained_df = clustering.explained_variance()
plotter.plot_pca_explained_variance(explained_df)
Number of components to retain ≥ 90% variance: 12
The first 3 principal components explain around 42–45% of the total variance, making them suitable for 3D visualization. About 9 components capture ~90% of the variance, enabling effective dimensionality reduction with minimal information loss.
if is_step_enabled('unsupervised_clustering'):
feature_contributions = clustering.get_feature_contributions()
print("Feature contributions to PCA components:")
print(feature_contributions)
loadings = clustering.get_loadings()
plotter.plot_pca_biplot(clustering.X_pca, loadings, clustering.numeric_df.columns)
Feature contributions to PCA components:
PC1 PC2 PC3 PC4 \
Gender -5.837364e-03 1.196770e-01 -2.650962e-03 2.812564e-01
Age -2.922519e-02 4.251152e-01 -1.368067e-01 4.240844e-01
Scholarship -1.874359e-01 2.514545e-01 1.135061e-01 2.408097e-01
Hypertension -4.289357e-01 -1.430078e-02 6.128584e-01 4.848147e-02
Diabetes 4.684101e-01 2.633773e-02 2.858862e-01 5.194370e-02
Alcoholism -1.679672e-01 -5.768804e-02 2.836091e-01 -2.583521e-02
Handcap -2.703040e-01 -1.160922e-02 2.147822e-01 1.677240e-02
SMS_received 5.402639e-01 -2.459760e-01 9.289261e-02 6.026642e-02
No-show 2.195234e-01 -9.583449e-03 1.535400e-01 5.151367e-02
WaitDays 2.360314e-01 -4.218770e-02 5.409839e-01 5.497996e-02
anxiety -1.762699e-01 -2.666693e-01 -9.259017e-02 -1.622846e-01
stress 4.919207e-02 2.491538e-01 6.840836e-02 -5.300167e-02
confusion 1.617717e-01 5.022303e-01 1.417780e-01 4.540366e-02
hopeful -5.884194e-02 -5.009183e-01 8.553286e-02 3.387939e-01
fear -1.996698e-02 -1.965803e-01 -1.377125e-01 7.223322e-01
Emotional_Distress 3.844967e-17 6.255467e-17 1.177523e-16 -1.325946e-17
PC5 PC6 PC7 PC8 \
Gender 2.134812e-01 7.807768e-02 1.169441e-03 -4.182740e-02
Age 3.169267e-01 5.984165e-02 1.291334e-01 -2.029923e-01
Scholarship 2.024550e-01 -3.040014e-02 -1.243534e-02 5.427730e-01
Hypertension 7.266704e-02 -8.438341e-02 -1.020836e-01 -1.919493e-01
Diabetes -5.432969e-03 7.794893e-01 1.033778e-01 2.634501e-02
Alcoholism -2.129636e-02 -9.466927e-02 8.309267e-01 -7.499322e-02
Handcap -3.281956e-02 2.227482e-01 -4.592875e-01 -2.248192e-01
SMS_received 3.229407e-01 -3.771297e-01 -7.059994e-03 -1.291102e-01
No-show 1.192008e-01 -3.120903e-01 -2.157472e-01 -1.058301e-03
WaitDays 2.508276e-02 -1.151146e-01 -6.542622e-02 -4.200629e-02
anxiety 7.856117e-01 1.689152e-01 8.516008e-03 1.221660e-01
stress 1.639190e-01 -3.217790e-03 -8.208680e-02 -3.638676e-02
confusion -1.030195e-01 -1.805535e-01 -3.420521e-02 3.619571e-01
hopeful -1.329567e-01 4.009817e-02 -4.215159e-02 5.685171e-01
fear -1.211621e-01 -2.160880e-02 6.442436e-03 -2.825379e-01
Emotional_Distress 2.480239e-16 1.209223e-16 1.397364e-16 -2.336754e-16
PC9 PC10 PC11 PC12 \
Gender -2.114255e-02 -4.189637e-02 0.357575 0.356088
Age -1.088313e-01 -1.268487e-01 -0.346938 -0.259728
Scholarship 2.150536e-01 4.457632e-01 -0.029377 -0.168567
Hypertension -2.920192e-01 -4.109716e-01 0.004984 -0.123023
Diabetes 3.548543e-03 -1.084770e-01 -0.048828 0.033457
Alcoholism 3.528206e-01 3.669021e-02 0.023350 0.135824
Handcap 6.944068e-01 6.151578e-02 -0.104534 0.053002
SMS_received 3.648491e-01 -2.041364e-01 0.013887 -0.306378
No-show -5.858319e-03 1.266374e-02 -0.181105 0.477818
WaitDays -3.102966e-01 5.550358e-01 -0.047040 -0.012788
anxiety -7.545962e-02 -2.897839e-02 -0.065791 0.225610
stress 3.529641e-02 -1.114575e-02 0.731644 -0.327285
confusion 1.130039e-01 -4.033983e-01 -0.070987 0.279660
hopeful -1.837883e-02 -2.730736e-01 0.057532 -0.151360
fear -2.977789e-03 9.743676e-02 0.112411 0.152057
Emotional_Distress 4.783145e-17 -3.414115e-17 -0.375777 -0.367721
PC13 PC14 PC15 PC16
Gender 7.733185e-02 -1.895983e-01 0.471776 0.575977
Age 3.913380e-01 -1.341778e-01 0.095273 -0.248936
Scholarship -2.391789e-01 3.696282e-01 0.160397 -0.017514
Hypertension -2.075549e-01 2.322649e-01 0.108737 -0.003848
Diabetes -6.594890e-02 2.364688e-01 -0.023187 -0.019167
Alcoholism 1.547308e-01 -1.024315e-02 -0.111838 0.022410
Handcap 1.401942e-01 -2.145590e-01 -0.032525 -0.041670
SMS_received -2.219283e-01 5.233089e-03 0.230610 -0.028386
No-show 4.579666e-01 5.101516e-01 -0.182586 0.051987
WaitDays 5.375094e-02 -4.593302e-01 -0.055853 -0.056386
anxiety -1.473045e-01 -1.628967e-01 -0.295996 -0.067693
stress 2.561858e-01 7.975534e-02 -0.421051 -0.002526
confusion -2.701598e-01 -3.368572e-01 -0.278478 -0.035756
hopeful 3.813475e-01 -1.749224e-01 -0.002539 -0.045723
fear -3.535912e-01 7.201160e-02 -0.382648 -0.057622
Emotional_Distress -2.035829e-16 1.586268e-16 -0.371732 0.765107
The 3D PCA biplot shows that features like Emotional_Distress, Hypertension, Diabetes, and SMS_received have strong directional contributions to the principal components, indicating they are key drivers of variance and likely critical for clustering and predictive modeling.
Elbow Curve¶
The elbow curve is a graphical method used to determine the optimal number of clusters (k) in clustering algorithms such as K-Means. By plotting the within-cluster sum of squares (WCSS) against different values of k, the curve typically shows a sharp decrease in WCSS as k increases, followed by a point where the rate of decrease slows down and the curve bends, forming an "elbow." This "elbow" point suggests a suitable number of clusters, as adding more clusters beyond this point yields diminishing returns in reducing WCSS (Ketchen & Shook, 1996).
elbow_method():
This method, defined in theClusteringAnalysisclass, calculates the within-cluster sum of squares (WCSS) for a range of cluster counts (k). It fits the clustering algorithm (typically K-Means) for each value of k in the specified range and records the WCSS. The method returns the list of WCSS values and the optimal k (the "elbow" point) where the decrease in WCSS starts to level off.plot_elbow_curve():
This plotting function visualizes the WCSS values against the number of clusters (k). It highlights the "elbow" point, helping users visually identify the optimal number of clusters for their data. The x-axis represents the number of clusters, and the y-axis shows the WCSS for each k.
These two functions are used together to select the most appropriate number of clusters for unsupervised learning tasks.
if is_step_enabled('unsupervised_clustering'):
wcss, optimal_k = clustering.elbow_method(k_range=range(1, 15))
plotter.plot_elbow_curve(list(range(1, 15)), wcss, optimal_k)
The elbow plot shows that the optimal number of clusters is k=4, where the WCSS (inertia) sharply decreases before flattening out. This suggests that using 4 clusters captures most of the data’s structure while avoiding overfitting.
K-Means clustering is an unsupervised machine learning algorithm that partitions data into k distinct, non-overlapping clusters based on feature similarity. The algorithm iteratively assigns each data point to the nearest cluster centroid and updates centroids to minimize the within-cluster sum of squares (WCSS), effectively grouping similar observations together (MacQueen, 1967).
For the patient no-show dataset, K-Means is valuable because it can uncover natural groupings among patients based on demographic, clinical, behavioral, and emotional features. By identifying clusters of patients with similar profiles—such as those with high emotional distress, chronic conditions, or specific appointment behaviors—healthcare providers can tailor interventions, personalize communication, and optimize scheduling strategies. K-Means is computationally efficient and interpretable, making it well-suited for large, mixed-type healthcare datasets (Ketchen & Shook, 1996).
Gaussian Mixture Model (GMM) is a probabilistic clustering algorithm that assumes the data is generated from a mixture of several Gaussian (normal) distributions, each representing a different cluster. Unlike K-Means, which assigns each data point to a single cluster, GMM estimates the probability that each point belongs to each cluster, allowing for soft (probabilistic) assignments. The model uses the Expectation-Maximization (EM) algorithm to iteratively estimate the parameters (means, covariances, and weights) of the Gaussian components that best fit the data distribution (Reynolds, 2009).
In the context of the patient no-show dataset, GMM is particularly useful because it can model clusters of varying shapes, sizes, and densities, which are common in real-world healthcare data. This flexibility allows GMM to uncover nuanced patient groupings based on demographic, clinical, and emotional features, providing deeper insights into patterns of appointment attendance and patient behavior.
The evaluate_clustering_performance method in the ClusteringAnalysis class compares KMeans and GMM clustering for different values of k. For each k, it fits both models to the PCA-reduced data, calculates performance metrics (e.g., silhouette score), and saves cluster labels. This helps identify the best clustering approach and optimal number of clusters.
if is_step_enabled('unsupervised_clustering'):
k_values = range(2, 8)
kmeans_scores, gmm_scores, kmeans_labels_list, gmm_labels_list = clustering.evaluate_clustering_performance(k_values)
for k in k_values:
kmeans_labels = kmeans_labels_list[k - k_values.start]
gmm_labels = gmm_labels_list[k - k_values.start]
plotter.plot_clustering_3d_side_by_side(clustering.X_reduced, kmeans_labels, gmm_labels, k)
- K-Means gives clean, fast segmentation; GMM captures soft, realistic overlaps.
- Optimal cluster count is k=4 (based on the elbow method), offering the best balance of separation and interpretability.
- Cluster insights (k=4):
- Cluster 0: Older, high-risk (e.g., hypertension) → prioritize support and follow-ups.
- Cluster 1: Younger, misses SMS → needs behavioral nudges and digital reminders.
- Cluster 2: Emotionally stable, low-risk → ideal for automation.
- Cluster 3: High emotional distress/alcoholism → assign care managers, structured support.
- Actionable Uses: No-show prevention, personalized outreach, better resource allocation, and feature engineering for predictive models.
if is_step_enabled('unsupervised_clustering'):
print('KMeans Scores:')
display(kmeans_scores)
print('GMM Scores:')
display(gmm_scores)
plotter.plot_clustering_scores(kmeans_scores, gmm_scores)
KMeans Scores:
| k | Silhouette Score | Davies-Bouldin Score | Calinski-Harabasz Score | |
|---|---|---|---|---|
| 0 | 2 | 0.149469 | 2.362978 | 17040.848357 |
| 1 | 3 | 0.159410 | 1.979498 | 17031.371113 |
| 2 | 4 | 0.156826 | 1.926502 | 15220.871132 |
| 3 | 5 | 0.148039 | 2.144950 | 12514.393745 |
| 4 | 6 | 0.143271 | 2.133657 | 11918.065318 |
| 5 | 7 | 0.151726 | 2.053565 | 12000.052416 |
GMM Scores:
| k | Silhouette Score | Davies-Bouldin Score | Calinski-Harabasz Score | |
|---|---|---|---|---|
| 0 | 2 | 0.123373 | 2.854364 | 11978.682032 |
| 1 | 3 | 0.136138 | 2.081934 | 15652.588788 |
| 2 | 4 | 0.108175 | 2.442411 | 10218.258226 |
| 3 | 5 | 0.131484 | 2.291962 | 10885.967972 |
| 4 | 6 | 0.146783 | 2.028766 | 11463.829010 |
| 5 | 7 | 0.129871 | 2.090993 | 11068.446895 |
- KMeans outperforms GMM across all metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz), especially at k = 3.
- k = 3 is optimal for KMeans, offering the best combination of compact, well-separated, and clearly defined clusters.
- GMM performs comparably at k = 6, but still shows weaker cohesion and separation overall.
- KMeans is preferred for this dataset due to its better-defined clusters and higher interpretability.
Based on PCA projections, explained variance, biplots, elbow method, clustering visualizations (KMeans & GMM), and cluster evaluation metrics, the following conclusions are drawn:
- Dimensionality Insights:
- PCA effectively reduced high-dimensional data to 3 components, retaining significant variance (~45%) and revealing clear feature-driven structure.
- Features like Emotional_Distress, SMS_received, Hypertension, and Diabetes strongly influence variance and cluster formation.
- Optimal Clustering Strategy:
- The elbow method identifies k = 4 as the optimal cluster count.
- Clustering visualizations show that KMeans at k = 4 provides clean, interpretable groupings, while GMM adds flexibility for overlapping cases.
- Evaluation Metrics Favor KMeans:
- KMeans outperforms GMM across all evaluation metrics (Silhouette, Davies-Bouldin, Calinski-Harabasz), with peak performance at k = 3 and stable structure at k = 4.
- GMM is more nuanced but does not outperform KMeans in this context.
- Actionable Patient Segments Identified
- Clusters reflect meaningful patient segments:
- High-risk patients (e.g., older, hypertensive, emotionally distressed)
- Digitally disengaged (e.g., missed SMS, younger)
- Low-risk, emotionally stable groups
- Emotionally or behaviorally vulnerable (e.g., alcohol-related, high fear/confusion) These groups can inform targeted interventions to reduce no-shows, improve engagement, and optimize healthcare resource allocation.
- Clusters reflect meaningful patient segments:
Use KMeans with k = 4 for segmentation, supported by PCA-reduced features. This approach offers the best balance of interpretability, performance, and actionable insight for improving patient appointment outcomes.
if is_step_enabled('nlp_sentiment_analysis'):
# First, let's check what columns are available in our DataFrame
print("Available columns:", df.columns.tolist())
print("DataFrame shape:", df.shape)
# Check if emotion columns exist, if not create them from PatientSentiment text
emotion_columns_exist = all(col in df.columns for col in EMOTION_STATES)
print(f"Emotion columns exist: {emotion_columns_exist}")
if not emotion_columns_exist:
print("Creating emotion columns from PatientSentiment text...")
# Create emotion columns by checking if emotion words appear in PatientSentiment
for emotion in EMOTION_STATES:
df[emotion] = df['PatientSentiment'].str.lower().str.contains(emotion, na=False).astype(int)
print("Emotion columns created successfully!")
# Now create our sentiment analysis DataFrame
features = ['PatientSentiment', 'No-show'] + EMOTION_STATES
available_features = [col for col in features if col in df.columns]
print(f"Using features: {available_features}")
sa_df = df[available_features].dropna()
print("Sentiment Analysis DataFrame shape:", sa_df.shape)
print("Sentiment Analysis DataFrame columns:", sa_df.columns.tolist())
print("Sample emotion distribution:")
for emotion in EMOTION_STATES:
if emotion in sa_df.columns:
print(f" {emotion}: {sa_df[emotion].sum()} positive cases out of {len(sa_df)}")
Available columns: ['Gender', 'ScheduledDay', 'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show', 'PatientNotes', 'PatientSentiment', 'NoShowReason', 'WaitDays', 'anxiety', 'stress', 'confusion', 'hopeful', 'fear'] DataFrame shape: (110527, 21) Emotion columns exist: True Using features: ['PatientSentiment', 'No-show', 'anxiety', 'stress', 'confusion', 'hopeful', 'fear'] Sentiment Analysis DataFrame shape: (110527, 7) Sentiment Analysis DataFrame columns: ['PatientSentiment', 'No-show', 'anxiety', 'stress', 'confusion', 'hopeful', 'fear'] Sample emotion distribution: anxiety: 54377 positive cases out of 110527 stress: 66678 positive cases out of 110527 confusion: 58866 positive cases out of 110527 hopeful: 11753 positive cases out of 110527 fear: 47130 positive cases out of 110527
if is_step_enabled('nlp_sentiment_analysis'):
# Initialize the improved sentiment analysis model
print("\nInitializing improved sentiment analysis model...")
sa_model = SentimentAnalysisModel(sa_df, emotional_states=EMOTION_STATES, device=NLP_CONFIG['device'])
# Train the model with improved anti-overfitting techniques
print("Training model with improved regularization...")
sa_model.train(epochs=5, patience=3) # More epochs but better patience for proper training
Initializing improved sentiment analysis model...
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training model with improved regularization... Epoch 1/5
Training: 100%|██████████| 4698/4698 [05:36<00:00, 13.97it/s, loss=0.366] Validating: 100%|██████████| 415/415 [00:15<00:00, 25.98it/s]
Epoch 1: Train Loss: 0.5625, Val Loss: 0.3612 Epoch 2/5
Training: 100%|██████████| 4698/4698 [04:03<00:00, 19.31it/s, loss=0.317] Validating: 100%|██████████| 415/415 [00:15<00:00, 26.39it/s]
Epoch 2: Train Loss: 0.3471, Val Loss: 0.2367 Epoch 3/5
Training: 100%|██████████| 4698/4698 [04:13<00:00, 18.53it/s, loss=0.319] Validating: 100%|██████████| 415/415 [00:14<00:00, 28.55it/s]
Epoch 3: Train Loss: 0.2972, Val Loss: 0.1981 Epoch 4/5
Training: 100%|██████████| 4698/4698 [05:07<00:00, 15.30it/s, loss=0.235] Validating: 100%|██████████| 415/415 [00:16<00:00, 25.45it/s]
Epoch 4: Train Loss: 0.2821, Val Loss: 0.1827 Epoch 5/5
Training: 100%|██████████| 4698/4698 [04:23<00:00, 17.83it/s, loss=0.267] Validating: 100%|██████████| 415/415 [00:15<00:00, 26.77it/s]
Epoch 5: Train Loss: 0.2769, Val Loss: 0.1781 Training completed in 1481.78 seconds
if is_step_enabled('nlp_sentiment_analysis'):
# Evaluate the model with threshold tuning
print("Evaluating model with optimized thresholds...")
predictions, actual_labels = sa_model.evaluate()
# Get metrics
sentiment_analysis_metrics = sa_model.report(predictions, actual_labels)
print("Training completed successfully!")
Evaluating model with optimized thresholds...
Evaluating: 100%|██████████| 691/691 [00:26<00:00, 26.35it/s]
Optimal threshold for anxiety: 0.55, F1: 0.962 Optimal threshold for stress: 0.45, F1: 0.960 Optimal threshold for confusion: 0.45, F1: 0.955 Optimal threshold for hopeful: 0.20, F1: 0.934 Optimal threshold for fear: 0.50, F1: 0.955 Training completed successfully!
if is_step_enabled('nlp_sentiment_analysis'):
# Print metrics in a readable format
plotter.print_sentiment_metrics(sentiment_analysis_metrics)
# Plot accuracy by emotion with overall accuracy line
plotter.plot_accuracy_by_emotion(sentiment_analysis_metrics)
# Plot confusion matrices for each emotion
plotter.plot_confusion_matrices(actual_labels, predictions, sa_model.emotional_states)
# Plot training and validation loss
sa_stats = sa_model.get_training_stats()
plotter.plot_training_validation_loss(sa_stats['training_losses'], sa_stats['validation_losses'])
# Plot time taken per epoch
plotter.plot_epoch_times(sa_stats['epoch_times'])
Model Accuracy by Emotion:
anxiety: 0.9642
stress: 0.9536
confusion: 0.9533
hopeful: 0.9870
fear: 0.9633
Overall Accuracy: 0.9643
Classification Reports:
Anxiety:
Not Present: {'precision': 0.9339512358049432, 'recall': 1.0, 'f1-score': 0.9658477613229135, 'support': 11185.0}
Present: {'precision': 1.0, 'recall': 0.9275707352806519, 'f1-score': 0.9624245879055627, 'support': 10921.0}
macro avg: {'precision': 0.9669756179024716, 'recall': 0.963785367640326, 'f1-score': 0.9641361746142381, 'support': 22106.0}
weighted avg: {'precision': 0.9665812255712607, 'recall': 0.9642178594046865, 'f1-score': 0.9641566151684355, 'support': 22106.0}
Stress:
Not Present: {'precision': 0.8970512157268494, 'recall': 0.9964371911274567, 'f1-score': 0.9441359032995753, 'support': 8701.0}
Present: {'precision': 0.9975082388875492, 'recall': 0.9257739649384558, 'f1-score': 0.9603033351388996, 'support': 13405.0}
macro avg: {'precision': 0.9472797273071993, 'recall': 0.9611055780329563, 'f1-score': 0.9522196192192375, 'support': 22106.0}
weighted avg: {'precision': 0.9579679982957982, 'recall': 0.9535872613770017, 'f1-score': 0.9539397766283613, 'support': 22106.0}
Confusion:
Not Present: {'precision': 0.9293095564700503, 'recall': 0.9750527729802341, 'f1-score': 0.9516317834901906, 'support': 10422.0}
Present: {'precision': 0.9767254498254409, 'recall': 0.9338411502909962, 'f1-score': 0.9548020126886896, 'support': 11684.0}
macro avg: {'precision': 0.9530175031477456, 'recall': 0.9544469616356152, 'f1-score': 0.9532168980894401, 'support': 22106.0}
weighted avg: {'precision': 0.9543709559979787, 'recall': 0.9532706052655388, 'f1-score': 0.95330739002033, 'support': 22106.0}
Hopeful:
Not Present: {'precision': 0.9856964864191378, 'recall': 1.0, 'f1-score': 0.9927967271540797, 'support': 19778.0}
Present: {'precision': 1.0, 'recall': 0.8767182130584192, 'f1-score': 0.9343099107347219, 'support': 2328.0}
macro avg: {'precision': 0.9928482432095689, 'recall': 0.9383591065292096, 'f1-score': 0.9635533189444008, 'support': 22106.0}
weighted avg: {'precision': 0.9872028005246407, 'recall': 0.987017099430019, 'f1-score': 0.9866374351689053, 'support': 22106.0}
Fear:
Not Present: {'precision': 0.9413295657346817, 'recall': 0.9981864059296641, 'f1-score': 0.9689246077305779, 'support': 12682.0}
Present: {'precision': 0.9973434973434974, 'recall': 0.9162775891341256, 'f1-score': 0.9550934631124876, 'support': 9424.0}
macro avg: {'precision': 0.9693365315390896, 'recall': 0.9572319975318948, 'f1-score': 0.9620090354215327, 'support': 22106.0}
weighted avg: {'precision': 0.965208842468667, 'recall': 0.9632678910702976, 'f1-score': 0.9630282580119096, 'support': 22106.0}
if is_step_enabled('nlp_sentiment_analysis'):
# Prepare data splits for hyperparameter tuning
X = df['PatientSentiment'].values
y = df[EMOTION_STATES].values
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=NLP_CONFIG['epochs'])
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.15, random_state=NLP_CONFIG['epochs'])
# Run hyperparameter tuning using the class method
results = SentimentAnalysisModel.run_hyperparameter_tuning(
X_train, y_train, X_val, y_val, X_test, y_test,
emotional_states=EMOTION_STATES,
device=NLP_CONFIG['device'],
tokenizer=sa_model.tokenizer,
max_seq_length=NLP_CONFIG['max_length']
)
--- Hyperparameter Configuration 1/2 --- Learning Rate: 5e-05 Batch Size: 16 Max Epochs: 2 Early Stopping Patience: 1
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Training Epoch 1: 100%|██████████| 4698/4698 [05:08<00:00, 15.25it/s, loss=0.213] Validating Epoch 1: 100%|██████████| 415/415 [00:18<00:00, 22.05it/s]
Epoch 1: Train Loss: 0.3582, Val Loss: 0.0007
Training Epoch 2: 100%|██████████| 4698/4698 [05:33<00:00, 14.10it/s, loss=0.354] Validating Epoch 2: 100%|██████████| 415/415 [00:16<00:00, 25.50it/s]
Epoch 2: Train Loss: 0.2659, Val Loss: 0.0007 Early stopping counter: 1/1 Early stopping triggered after 2 epochs
Evaluating: 100%|██████████| 691/691 [00:26<00:00, 25.72it/s]
Overall Accuracy: 0.9614 Training Time: 676.38 seconds --- Hyperparameter Configuration 2/2 --- Learning Rate: 0.0001 Batch Size: 16 Max Epochs: 2 Early Stopping Patience: 1
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Training Epoch 1: 100%|██████████| 4698/4698 [05:41<00:00, 13.74it/s, loss=0.188] Validating Epoch 1: 100%|██████████| 415/415 [00:14<00:00, 27.95it/s]
Epoch 1: Train Loss: 0.2736, Val Loss: 0.0005
Training Epoch 2: 100%|██████████| 4698/4698 [05:17<00:00, 14.81it/s, loss=0.184] Validating Epoch 2: 100%|██████████| 415/415 [00:14<00:00, 29.04it/s]
Epoch 2: Train Loss: 0.1837, Val Loss: 0.0005 Early stopping counter: 1/1 Early stopping triggered after 2 epochs
Evaluating: 100%|██████████| 691/691 [00:23<00:00, 28.85it/s]
Overall Accuracy: 0.9627 Training Time: 688.20 seconds
if is_step_enabled('nlp_sentiment_analysis'):
# Print and plot metrics for each configuration
for i, res in enumerate(results):
print(f"\n--- Results for Hyperparameter Configuration {i+1} ---")
# Compute metrics for each configuration
emotion_accuracies = {emo: accuracy_score(res['actual_labels'][:, idx], res['predictions'][:, idx]) for idx, emo in enumerate(EMOTION_STATES)}
sentiment_analysis_metrics = {
'emotion_accuracies': emotion_accuracies,
'overall_accuracy': res['accuracy'],
'classification_reports': {} # Optionally fill with classification_report if needed
}
plotter.print_sentiment_metrics(sentiment_analysis_metrics)
plotter.plot_accuracy_by_emotion(sentiment_analysis_metrics)
plotter.plot_confusion_matrices(res['actual_labels'], res['predictions'], EMOTION_STATES)
plotter.plot_training_validation_loss(res['train_losses'], res['val_losses'])
plotter.plot_epoch_times(res['epoch_times'])
--- Results for Hyperparameter Configuration 1 --- Model Accuracy by Emotion: anxiety: 0.9627 stress: 0.9488 confusion: 0.9525 hopeful: 0.9862 fear: 0.9571 Overall Accuracy: 0.9614 Classification Reports:
--- Results for Hyperparameter Configuration 2 --- Model Accuracy by Emotion: anxiety: 0.9626 stress: 0.9499 confusion: 0.9526 hopeful: 0.9877 fear: 0.9607 Overall Accuracy: 0.9627 Classification Reports:
if is_step_enabled('nlp_sentiment_analysis'):
# Select the best model based on accuracy and training time using the class method
best_model, best_params, best_idx, combined_scores = SentimentAnalysisModel.get_best_model_from_results(results)
print(f"\nBest model configuration (balanced for both accuracy and speed):")
print(f"Learning Rate: {best_params['learning_rate']}")
print(f"Batch Size: {best_params['batch_size']}")
print(f"Epochs: {best_params['epochs']}")
print(f"Accuracy: {results[best_idx]['accuracy']:.4f}")
print(f"Training Time: {results[best_idx].get('training_time', sum(results[best_idx]['epoch_times'])):.2f} seconds")
print(f"Combined Score: {combined_scores[best_idx]:.4f}")
# Plot ROC and AUC for each emotion using the class-based plotter
plotter.plot_roc_auc_by_emotion(actual_labels, predictions, EMOTION_STATES)
Best model configuration (balanced for both accuracy and speed): Learning Rate: 5e-05 Batch Size: 16 Epochs: 2 Accuracy: 0.9614 Training Time: 676.38 seconds Combined Score: 0.9730
if is_step_enabled('nlp_sentiment_analysis'):
# Export the best model and tokenizer after hyperparameter tuning
SentimentAnalysisModel.export_best_model(
best_model,
sa_model.tokenizer,
SENTIMENT_MODEL_EXPORT_PATH_RAW
)
Best model and tokenizer exported to: d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\models\nlp\sentiment_analysis_raw
if is_step_enabled('nlp_sentiment_analysis'):
example_text = "Patient (minor) is anxious and fearful about medical procedures, sometimes confused by instructions, and stressed by separation from family."
expected = {'anxiety': True, 'stress': True, 'confusion': True, 'hopeful': False, 'fear': True}
raw_pred = SentimentAnalysisModel.predict_emotions_raw(
example_text,
sa_model.model,
sa_model.tokenizer,
NLP_CONFIG['device']
)
print("Example text:")
print(example_text)
print("\nEmotion prediction comparison:")
for emo in expected:
result = "✅" if raw_pred[emo] == expected[emo] else "❌"
print(f"{emo}: expected={expected[emo]}, predicted={raw_pred[emo]} {result}")
Example text: Patient (minor) is anxious and fearful about medical procedures, sometimes confused by instructions, and stressed by separation from family. Emotion prediction comparison: anxiety: expected=True, predicted=False ❌ stress: expected=True, predicted=False ❌ confusion: expected=True, predicted=False ❌ hopeful: expected=False, predicted=False ✅ fear: expected=True, predicted=False ❌
if is_step_enabled('nlp_sentiment_analysis'):
# Run the raw model test
!pytest -s ../tests/test_sentiment_anlaysis.py -k test_sentiment_model_predictions_raw --maxfail=1 --disable-warnings -q
--- Running test_sentiment_model_predictions_raw ---
❌ Test FAILED for: Patient is hopeful and shows no significant anxiety, stress, or fear related to health conditions.
Prediction: {'anxiety': 1, 'stress': 1, 'confusion': 0, 'hopeful': 0, 'fear': 1}
Expected: ['hopeful']
F
================================== FAILURES ===================================
____________________ test_sentiment_model_predictions_raw _____________________
def test_sentiment_model_predictions_raw():
print("\n--- Running test_sentiment_model_predictions_raw ---")
model, tokenizer, _ = _load_model_and_tokenizer(SENTIMENT_MODEL_EXPORT_PATH_RAW)
model.model.eval()
results = []
for text in TEST_TEXTS:
encoding = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=NLP_CONFIG['max_length'],
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
input_ids = encoding['input_ids'].to(NLP_CONFIG['device'])
attention_mask = encoding['attention_mask'].to(NLP_CONFIG['device'])
with torch.no_grad():
outputs = model.model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
probs = torch.sigmoid(logits).cpu().numpy()[0]
preds = (probs >= 0.5).astype(int)
results.append(preds)
for pred in results:
assert len(pred) == len(EMOTION_STATES)
assert all((p == 0 or p == 1) for p in pred)
> passed, total = _print_and_score_results(results, TEST_TEXTS, EXPECTED_EMOTIONS)
..\tests\test_sentiment_anlaysis.py:89:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
results = [array([1, 1, 0, 0, 1]), array([1, 1, 0, 0, 1]), array([0, 1, 1, 0, 1]), array([0, 0, 0, 0, 0]), array([1, 1, 0, 0, 1]), array([0, 0, 0, 0, 0]), ...]
test_texts = ['Patient is hopeful and shows no significant anxiety, stress, or fear related to health conditions.', 'Patient expres... or fear during the appointment.', 'Patient is confused about the medication schedule and expresses frustration.', ...]
expected_emotions = [['hopeful'], ['fear', 'anxiety'], ['fear', 'confusion', 'stress'], ['anxiety', 'fear', 'confusion', 'stress'], [], ['confusion'], ...]
def _print_and_score_results(results, test_texts, expected_emotions):
total = len(test_texts)
passed = 0
for idx, (text, pred) in enumerate(zip(test_texts, results)):
pred_dict = {emo: int(val) for emo, val in zip(EMOTION_STATES, pred)}
test_passed = True
for emo in expected_emotions[idx]:
if pred_dict[emo] != 1:
test_passed = False
print(f"\u274C Test FAILED for: {text}\nPrediction: {pred_dict}\nExpected: {expected_emotions[idx]}\n")
> assert False, f"Expected emotion '{emo}' to be present in: {text} (got {pred_dict})"
E AssertionError: Expected emotion 'hopeful' to be present in: Patient is hopeful and shows no significant anxiety, stress, or fear related to health conditions. (got {'anxiety': 1, 'stress': 1, 'confusion': 0, 'hopeful': 0, 'fear': 1})
E assert False
..\tests\test_sentiment_anlaysis.py:56: AssertionError
=========================== short test summary info ===========================
FAILED ..\tests\test_sentiment_anlaysis.py::test_sentiment_model_predictions_raw - AssertionError: Expected emotion 'hopeful' to be present in: Patient is hop...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!
1 failed, 1 deselected in 35.87s
if is_step_enabled('nlp_sentiment_analysis'):
example_text = "Patient (minor) is anxious and fearful about medical procedures, sometimes confused by instructions, and stressed by separation from family."
expected = {'anxiety': True, 'stress': True, 'confusion': True, 'hopeful': False, 'fear': True}
post_processed = SentimentAnalysisModel.predict_emotions(
example_text,
sa_model.model,
sa_model.tokenizer,
NLP_CONFIG['device'],
emotion_variations_path=EMOTION_VARIATIONS_PATH,
negation_patterns_path=NEGATION_PATTERNS_PATH
)
print("Post-processed emotion prediction:", post_processed)
print("Example text:")
print(example_text)
print("\nEmotion prediction comparison:")
for emo in expected:
result = "✅" if post_processed[emo] == expected[emo] else "❌"
print(f"{emo}: expected={expected[emo]}, predicted={post_processed[emo]} {result}")
Post-processed emotion prediction: {'anxiety': True, 'stress': True, 'confusion': True, 'hopeful': False, 'fear': True}
Example text:
Patient (minor) is anxious and fearful about medical procedures, sometimes confused by instructions, and stressed by separation from family.
Emotion prediction comparison:
anxiety: expected=True, predicted=True ✅
stress: expected=True, predicted=True ✅
confusion: expected=True, predicted=True ✅
hopeful: expected=False, predicted=False ✅
fear: expected=True, predicted=True ✅
if is_step_enabled('nlp_sentiment_analysis'):
# Evaluate the model with post-processing on the test set
results_post = SentimentAnalysisModel.evaluate_model_with_post_processing(
sa_model.model,
sa_model.test_loader,
sa_model.tokenizer,
NLP_CONFIG['device'],
emotion_variations_path=EMOTION_VARIATIONS_PATH,
negation_patterns_path=NEGATION_PATTERNS_PATH
)
print("\nPost-processing overall accuracy:", results_post['accuracy'])
print("Emotion-wise accuracies:", results_post['emotion_accuracies'])
Evaluating with Post-Processing: 100%|██████████| 691/691 [03:08<00:00, 3.67it/s]
Post-processing overall accuracy: 0.669266262553153
Emotion-wise accuracies: {'anxiety': 0.6084320998823849, 'stress': 0.6616303266081607, 'confusion': 0.9215145209445399, 'hopeful': 0.46001085678096443, 'fear': 0.694743508549715}
if is_step_enabled('nlp_sentiment_analysis'):
# Export the optimized model and tokenizer with post-processor config
os.makedirs(SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED, exist_ok=True)
shutil.copy(EMOTION_VARIATIONS_PATH, os.path.join(SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED, os.path.basename(EMOTION_VARIATIONS_PATH)))
shutil.copy(NEGATION_PATTERNS_PATH, os.path.join(SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED, os.path.basename(NEGATION_PATTERNS_PATH)))
SentimentAnalysisModel.export_best_model(
best_model,
sa_model.tokenizer,
SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED
)
print(f"Optimized model and post-processor config exported to: {SENTIMENT_MODEL_EXPORT_PATH_OPTIMIZED}")
Best model and tokenizer exported to: d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\models\nlp\sentiment_analysis_optimized Optimized model and post-processor config exported to: d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\models\nlp\sentiment_analysis_optimized
if is_step_enabled('nlp_sentiment_analysis'):
# Run the optimized model test
!pytest -s ../tests/test_sentiment_anlaysis.py -k test_sentiment_model_predictions_optimized --maxfail=1 --disable-warnings -q
--- Running test_sentiment_model_predictions_optimized ---
✅ Test PASSED for: Patient is hopeful and shows no significant anxiety, stress, or fear related to health conditions.
Prediction: {'anxiety': 0, 'stress': 0, 'confusion': 0, 'hopeful': 1, 'fear': 0}
Expected: ['hopeful']
✅ Test PASSED for: Patient expresses fear and anxiety about high blood pressure and possible complications.
Prediction: {'anxiety': 1, 'stress': 0, 'confusion': 0, 'hopeful': 0, 'fear': 1}
Expected: ['fear', 'anxiety']
✅ Test PASSED for: Elderly patient expresses fear of declining health, confusion about medications, and stress related to mobility issues.
Prediction: {'anxiety': 1, 'stress': 1, 'confusion': 1, 'hopeful': 0, 'fear': 1}
Expected: ['fear', 'confusion', 'stress']
✅ Test PASSED for: Patient (minor) is anxious and fearful about medical procedures, sometimes confused by instructions, and stressed by separation from family.
Prediction: {'anxiety': 1, 'stress': 1, 'confusion': 1, 'hopeful': 0, 'fear': 1}
Expected: ['anxiety', 'fear', 'confusion', 'stress']
✅ Test PASSED for: Patient is calm and shows no signs of stress, anxiety, or fear during the appointment.
Prediction: {'anxiety': 1, 'stress': 1, 'confusion': 0, 'hopeful': 0, 'fear': 1}
Expected: []
✅ Test PASSED for: Patient is confused about the medication schedule and expresses frustration.
Prediction: {'anxiety': 0, 'stress': 0, 'confusion': 1, 'hopeful': 0, 'fear': 0}
Expected: ['confusion']
✅ Test PASSED for: Patient is hopeful about recovery but still experiences occasional stress.
Prediction: {'anxiety': 0, 'stress': 1, 'confusion': 0, 'hopeful': 1, 'fear': 0}
Expected: ['hopeful', 'stress']
✅ Test PASSED for: Patient is fearful of surgery and anxious about the outcome.
Prediction: {'anxiety': 1, 'stress': 0, 'confusion': 0, 'hopeful': 0, 'fear': 1}
Expected: ['fear', 'anxiety']
✅ Test PASSED for: Patient expresses both hope and anxiety regarding the new treatment plan.
Prediction: {'anxiety': 1, 'stress': 0, 'confusion': 0, 'hopeful': 1, 'fear': 0}
Expected: ['hopeful', 'anxiety']
✅ Test PASSED for: Patient is neither anxious nor fearful, but is confused by the instructions.
Prediction: {'anxiety': 1, 'stress': 0, 'confusion': 1, 'hopeful': 0, 'fear': 1}
Expected: ['confusion']
Test score (optimized): 10/10 passed.
.
1 passed, 1 deselected in 8.63s
# Topic Modeling for Diabetes, Hypertension, Alcoholism using ClinicalTopicModel class and project architecture
if is_step_enabled('nlp_topic_modeling'):
model = ClinicalTopicModel(config)
conditions = ['diabetes', 'hypertension', 'alcohol']
perplexities = []
silhouette_scores = []
all_topics = []
for cond in conditions:
df_cond = model.preprocess_notes(df, cond)
if df_cond.empty:
print(f"Skipping {cond}: No clinical concepts found after MedSpaCy extraction.")
perplexities.append(None)
silhouette_scores.append(None)
all_topics.append([])
continue
model.train(df_cond['PatientNotes_clean'])
perplexity, sil_score = model.evaluate(df_cond['PatientNotes_clean'])
print(f"\n--- {cond.title()} ---")
print(f"Model Perplexity: {perplexity:.2f}")
if sil_score is not None:
print(f"Silhouette Score: {sil_score:.2f}")
topics = model.get_topics(n_top_words=10)
for idx, topic_words in enumerate(topics):
print(f"Topic {idx+1}: {' '.join(topic_words)}")
perplexities.append(perplexity)
silhouette_scores.append(sil_score if sil_score is not None else 0)
all_topics.append(topics)
Fitting 2 folds for each of 5 candidates, totalling 10 fits --- Diabetes --- Model Perplexity: 17.41 Silhouette Score: 0.10 Topic 1: diabetes hba1c hypertension metformin type_2_diabetes insulin fasting_glucose atenolol hypoglycemia glipizide Topic 2: hypertension diabetes blood_pressure diabetes_screening cardiovascular_risk_assessment alcoholism weight_management_counseling amlodipine medication_adherence_counseling cholesterol_screening Fitting 2 folds for each of 5 candidates, totalling 10 fits --- Hypertension --- Model Perplexity: 14.13 Silhouette Score: 0.56 Topic 1: blood_pressure amlodipine sleep_hygiene_education losartan ast alt thiamine folic_acid hepatic_steatosis negated_hypertension Topic 2: diabetes hba1c cardiovascular_risk_assessment metformin fasting_glucose atenolol type_2_diabetes negated_hypertension negated_amlodipine chronic_kidney_disease Topic 3: alcoholism weight_management_counseling diabetes_screening medication_adherence_counseling patient_education alcohol_screening hydrochlorothiazide resistant_hypertension obstructive_sleep_apnea left_ventricular_hypertrophy Fitting 2 folds for each of 5 candidates, totalling 10 fits --- Alcohol --- Model Perplexity: 20.55 Silhouette Score: 0.10 Topic 1: alcoholism alcohol_use_disorder alt ast motivational_interviewing heavy_drinking thiamine folic_acid disulfiram acamprosate Topic 2: hypertension alcoholism diabetes_screening weight_management_counseling blood_pressure patient_education cholesterol_screening medication_adherence_counseling ecg atrial_fibrillation Topic 3: hypertension alcohol_screening hydrochlorothiazide blood_pressure amlodipine losartan left_ventricular_hypertrophy echocardiogram chronic_kidney_disease proteinuria Topic 4: diabetes hba1c hypertension hypoglycemia glipizide metformin fasting_glucose alcoholism type_2_diabetes insulin
if is_step_enabled('nlp_topic_modeling'):
# Visualize clinical entities for a note after topic modeling
if not df_cond.empty:
# Visualize clinical entities for 10 notes after topic modeling
for i in range(min(10, len(df_cond))):
sample_note = df_cond['PatientNotes'].iloc[i]
print(f'Visualizing clinical entities for a note {i+1}:')
model.plot_medspacy_ents(sample_note)
print('Visualizing clinical entities for a note:')
model.plot_medspacy_ents(sample_note)
else:
print('No notes available for visualization.')
# Plot word clouds for each condition after the loop
for cond in conditions:
plotter.plot_wordclouds(model.model, model.vectorizer, cond)
[autoreload of src.config failed: Traceback (most recent call last):
File "d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\.venv\lib\site-packages\IPython\extensions\autoreload.py", line 276, in check
superreload(m, reload, self.old_objects)
File "d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\.venv\lib\site-packages\IPython\extensions\autoreload.py", line 475, in superreload
module = reload(module)
File "C:\Python310\lib\importlib\__init__.py", line 169, in reload
_bootstrap._exec(spec, module)
File "<frozen importlib._bootstrap>", line 619, in _exec
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "d:\Personal\AI-Admissions\Semester 3\AAI-510 - Machine learning Fundamentals and Applications\Final Team Project\aai510_3proj\src\config.py", line 6, in <module>
from dotenv import load_dotenv
ModuleNotFoundError: No module named 'dotenv'
]
Visualizing clinical entities for a note 1:
Visualizing clinical entities for a note 2:
Visualizing clinical entities for a note 3:
Visualizing clinical entities for a note 4:
Visualizing clinical entities for a note 5:
Visualizing clinical entities for a note 6:
Visualizing clinical entities for a note 7:
Visualizing clinical entities for a note 8:
Visualizing clinical entities for a note 9:
Visualizing clinical entities for a note 10:
Visualizing clinical entities for a note:
if is_step_enabled('nlp_topic_modeling'):
# Plot Perplexity and Silhouette Score using plotter
plotter.plot_bar(conditions, perplexities, title='LDA Model Perplexity by Condition (MedSpaCy)', ylabel='Perplexity')
plotter.plot_bar(conditions, silhouette_scores, title='LDA Silhouette Score by Condition (MedSpaCy)', ylabel='Silhouette Score')
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
- Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2017). New Avenues in Opinion Mining and Sentiment Analysis. IEEE Intelligent Systems, 28(2), 15–21.
- Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
- Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
- MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297).
- Reynolds, D. A. (2009). Gaussian Mixture Models. In Encyclopedia of Biometrics (pp. 659–663). Springer.
- Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6), 441–458. https://doi.org/10.1002/(SICI)1097-0266(199606)
- MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 281–297).
- Ketchen, D. J., & Shook, C. L. (1996). The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17(6), 441–458. https://doi.org/10.1002/(SICI)1097-0266(199606)
- Reynolds, D. A. (2009). Gaussian Mixture Models. In Encyclopedia of Biometrics (pp. 659–663). Springer.
